Body Mass Index (BMI), age, height and weight are important indicators of human health conditions, which can provide useful information for plenty of practical purposes, such as health care, monitoring and re-identification. Most existing methods of health indicator prediction mainly use front-view body or face images. These inputs are hard to be obtained in daily life and often lead to the lack of robustness for the models, considering their strict requirements on view and pose. In this paper, we propose to employ gait videos to predict health indicators, which are more prevalent in surveillance and home monitoring scenarios. However, the study of health indicator prediction from gait videos using deep learning was hindered due to the small amount of open-sourced data. To address this issue, we analyse the similarity and relationship between pose estimation and health indicator prediction tasks, and then propose a paradigm enabling deep learning for small health indicator datasets by pre-training on the pose estimation task. Furthermore, to better suit the health indicator prediction task, we bring forward Global-Local Aware aNd Centrosymmetric Encoder (GLANCE) module. It first extracts local and global features by progressive convolutions and then fuses multi-level features by a centrosymmetric double-path hourglass structure in two different ways. Experiments demonstrate that the proposed paradigm achieves state-of-the-art results for predicting health indicators on MoVi, and that the GLANCE module is also beneficial for pose estimation on 3DPW.
translated by 谷歌翻译
Explainability of Graph Neural Networks (GNNs) is critical to various GNN applications but remains an open challenge. A convincing explanation should be both necessary and sufficient simultaneously. However, existing GNN explaining approaches focus on only one of the two aspects, necessity or sufficiency, or a trade-off between the two. To search for the most necessary and sufficient explanation, the Probability of Necessity and Sufficiency (PNS) can be applied since it can mathematically quantify the necessity and sufficiency of an explanation. Nevertheless, the difficulty of obtaining PNS due to non-monotonicity and the challenge of counterfactual estimation limits its wide use. To address the non-identifiability of PNS, we resort to a lower bound of PNS that can be optimized via counterfactual estimation, and propose Necessary and Sufficient Explanation for GNN (NSEG) via optimizing that lower bound. Specifically, we employ nearest neighbor matching to generate counterfactual samples for the features, which is different from the random perturbation. In particular, NSEG combines the edges and node features to generate an explanation, where the common edge explanation is a special case of the combined explanation. Empirical study shows that NSEG achieves excellent performance in generating the most necessary and sufficient explanations among a series of state-of-the-art methods.
translated by 谷歌翻译
图表可以模拟实体之间的复杂交互,它在许多重要的应用程序中自然出现。这些应用程序通常可以投入到标准图形学习任务中,其中关键步骤是学习低维图表示。图形神经网络(GNN)目前是嵌入方法中最受欢迎的模型。然而,邻域聚合范例中的标准GNN患有区分\ EMPH {高阶}图形结构的有限辨别力,而不是\ EMPH {低位}结构。为了捕获高阶结构,研究人员求助于主题和开发的基于主题的GNN。然而,现有的基于主基的GNN仍然仍然遭受较少的辨别力的高阶结构。为了克服上述局限性,我们提出了一个新颖的框架,以更好地捕获高阶结构的新框架,铰接于我们所提出的主题冗余最小化操作员和注射主题组合的新颖框架。首先,MGNN生成一组节点表示W.R.T.每个主题。下一阶段是我们在图案中提出的冗余最小化,该主题在彼此相互比较并蒸馏出每个主题的特征。最后,MGNN通过组合来自不同图案的多个表示来执行节点表示的更新。特别地,为了增强鉴别的功率,MGNN利用重新注射功能来组合表示的函数w.r.t.不同的主题。我们进一步表明,我们的拟议体系结构增加了GNN的表现力,具有理论分析。我们展示了MGNN在节点分类和图形分类任务上的七个公共基准上表现出最先进的方法。
translated by 谷歌翻译
卷积神经网络(CNN)已在许多物联网(IoT)设备中应用于多种下游任务。但是,随着边缘设备上的数据量的增加,CNN几乎无法及时完成某些任务,而计算和存储资源有限。最近,过滤器修剪被认为是压缩和加速CNN的有效技术,但是从压缩高维张量的角度来看,现有的方法很少是修剪CNN。在本文中,我们提出了一种新颖的理论,可以在三维张量中找到冗余信息,即量化特征图(QSFM)之间的相似性,并利用该理论来指导滤波器修剪过程。我们在数据集(CIFAR-10,CIFAR-100和ILSVRC-12)上执行QSFM和Edge设备,证明所提出的方法可以在神经网络中找到冗余信息,具有可比的压缩和可耐受的准确性下降。没有任何微调操作,QSFM可以显着压缩CIFAR-56(48.7%的Flops和57.9%的参数),而TOP-1的准确性仅损失0.54%。对于边缘设备的实际应用,QSFM可以将Mobilenet-V2推理速度加速1.53倍,而ILSVRC-12 TOP-1的精度仅损失1.23%。
translated by 谷歌翻译
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. However, domain discrepancies in low-level image statistics and high-level contexts compromise the segmentation performance over the target domain. A key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. Unfortunately, there is a lack of such unified approaches for UDA tasks in the existing literature. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation. Concretely, for image-level domain shifts, we propose a global photometric alignment module and a global texture alignment module that align images in the source and target domains in terms of image-level properties. For feature-level domain shifts, we perform global manifold alignment by projecting pixel features from both domains onto the feature manifold of the source domain; and we further regularize category centers in the source domain through a category-oriented triplet loss and perform target domain consistency regularization over augmented target domain images. Experimental results demonstrate that our pipeline significantly outperforms previous methods. In the commonly tested GTA5$\rightarrow$Cityscapes task, our proposed method using Deeplab V3+ as the backbone surpasses previous SOTA by 8%, achieving 58.2% in mIoU.
translated by 谷歌翻译
Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot portrait to speak with the reference speaking style and another piece of audio. Specifically, we first develop a style encoder to extract dynamic facial motion patterns of a style reference video and then encode them into a style code. Afterward, we introduce a style-controllable decoder to synthesize stylized facial animations from the speech content and style code. In order to integrate the reference speaking style into generated videos, we design a style-aware adaptive transformer, which enables the encoded style code to adjust the weights of the feed-forward layers accordingly. Thanks to the style-aware adaptation mechanism, the reference speaking style can be better embedded into synthesized videos during decoding. Extensive experiments demonstrate that our method is capable of generating talking head videos with diverse speaking styles from only one portrait image and an audio clip while achieving authentic visual effects. Project Page: https://github.com/FuxiVirtualHuman/styletalk.
translated by 谷歌翻译
Witnessing the impressive achievements of pre-training techniques on large-scale data in the field of computer vision and natural language processing, we wonder whether this idea could be adapted in a grab-and-go spirit, and mitigate the sample inefficiency problem for visuomotor driving. Given the highly dynamic and variant nature of the input, the visuomotor driving task inherently lacks view and translation invariance, and the visual input contains massive irrelevant information for decision making, resulting in predominant pre-training approaches from general vision less suitable for the autonomous driving task. To this end, we propose PPGeo (Policy Pre-training via Geometric modeling), an intuitive and straightforward fully self-supervised framework curated for the policy pretraining in visuomotor driving. We aim at learning policy representations as a powerful abstraction by modeling 3D geometric scenes on large-scale unlabeled and uncalibrated YouTube driving videos. The proposed PPGeo is performed in two stages to support effective self-supervised training. In the first stage, the geometric modeling framework generates pose and depth predictions simultaneously, with two consecutive frames as input. In the second stage, the visual encoder learns driving policy representation by predicting the future ego-motion and optimizing with the photometric error based on current visual observation only. As such, the pre-trained visual encoder is equipped with rich driving policy related representations and thereby competent for multiple visuomotor driving tasks. Extensive experiments covering a wide span of challenging scenarios have demonstrated the superiority of our proposed approach, where improvements range from 2% to even over 100% with very limited data. Code and models will be available at https://github.com/OpenDriveLab/PPGeo.
translated by 谷歌翻译
Increasing research interests focus on sequential recommender systems, aiming to model dynamic sequence representation precisely. However, the most commonly used loss function in state-of-the-art sequential recommendation models has essential limitations. To name a few, Bayesian Personalized Ranking (BPR) loss suffers the vanishing gradient problem from numerous negative sampling and predictionbiases; Binary Cross-Entropy (BCE) loss subjects to negative sampling numbers, thereby it is likely to ignore valuable negative examples and reduce the training efficiency; Cross-Entropy (CE) loss only focuses on the last timestamp of the training sequence, which causes low utilization of sequence information and results in inferior user sequence representation. To avoid these limitations, in this paper, we propose to calculate Cumulative Cross-Entropy (CCE) loss over the sequence. CCE is simple and direct, which enjoys the virtues of painless deployment, no negative sampling, and effective and efficient training. We conduct extensive experiments on five benchmark datasets to demonstrate the effectiveness and efficiency of CCE. The results show that employing CCE loss on three state-of-the-art models GRU4Rec, SASRec, and S3-Rec can reach 125.63%, 69.90%, and 33.24% average improvement of full ranking NDCG@5, respectively. Using CCE, the performance curve of the models on the test data increases rapidly with the wall clock time, and is superior to that of other loss functions in almost the whole process of model training.
translated by 谷歌翻译
Recent advances in self-supervised learning (SSL) in computer vision are primarily comparative, whose goal is to preserve invariant and discriminative semantics in latent representations by comparing siamese image views. However, the preserved high-level semantics do not contain enough local information, which is vital in medical image analysis (e.g., image-based diagnosis and tumor segmentation). To mitigate the locality problem of comparative SSL, we propose to incorporate the task of pixel restoration for explicitly encoding more pixel-level information into high-level semantics. We also address the preservation of scale information, a powerful tool in aiding image understanding but has not drawn much attention in SSL. The resulting framework can be formulated as a multi-task optimization problem on the feature pyramid. Specifically, we conduct multi-scale pixel restoration and siamese feature comparison in the pyramid. In addition, we propose non-skip U-Net to build the feature pyramid and develop sub-crop to replace multi-crop in 3D medical imaging. The proposed unified SSL framework (PCRLv2) surpasses its self-supervised counterparts on various tasks, including brain tumor segmentation (BraTS 2018), chest pathology identification (ChestX-ray, CheXpert), pulmonary nodule detection (LUNA), and abdominal organ segmentation (LiTS), sometimes outperforming them by large margins with limited annotations.
translated by 谷歌翻译
The visual dimension of cities has been a fundamental subject in urban studies, since the pioneering work of scholars such as Sitte, Lynch, Arnheim, and Jacobs. Several decades later, big data and artificial intelligence (AI) are revolutionizing how people move, sense, and interact with cities. This paper reviews the literature on the appearance and function of cities to illustrate how visual information has been used to understand them. A conceptual framework, Urban Visual Intelligence, is introduced to systematically elaborate on how new image data sources and AI techniques are reshaping the way researchers perceive and measure cities, enabling the study of the physical environment and its interactions with socioeconomic environments at various scales. The paper argues that these new approaches enable researchers to revisit the classic urban theories and themes, and potentially help cities create environments that are more in line with human behaviors and aspirations in the digital age.
translated by 谷歌翻译